An Introduction to Python Dashboards

Marc Dotson

2025-02-28

  • Provide some guidance for the datathon
  • Focus on Python beginners
  • Assume you’ve installed Python (and an IDE)
  • Focus is on three specific data tools
  • Special focus on human-readable code
  • Need help? Raise your hand or talk after

Polars

  • New data wrangling library
  • Alternative to Pandas for using DataFrames
  • Fast – in Rust, uses Apache Arrow, built to parallelize and use GPUs, allows for lazy evaluation
  • More consistent syntax than Pandas
  • Anagram of its query engine (OLAP) and Rust (rs)

Filter, slice, and sort observations

Polars syntax follows a human-readable, SQL-like grammar

  • Install with pip install polars
  • Parameters for .slice() are the start index and length
  • We use pl.col() to reference variables in our data
import polars as pl
import os

customer_data = pl.read_csv(os.path.join('data', 'customer_data.csv'))
customer_data.shape
customer_data.columns

customer_data.filter(pl.col('college_degree') == 'Yes')
customer_data.filter(pl.col('region') != 'West')
customer_data.filter(pl.col('gender') != 'Female', pl.col('income') > 70000)

customer_data.slice(0, 5)

customer_data.sort(pl.col('birth_year'))
customer_data.sort(pl.col('birth_year'), descending = True)

Select and recode/create variables, join data frames

Polars .filter() and .select() are separate methods

customer_data.select(pl.col('region'), pl.col('review_text'))
customer_data.select(pl.col(['region', 'review_text']))

customer_data.with_columns(income_new = pl.col('income') / 1000)

store_transactions = pl.read_csv(os.path.join('data', 'store_transactions.csv'))

customer_data.join(store_transactions, on = 'customer_id', how = 'left')
customer_data.join(store_transactions, on = 'customer_id', how = 'inner')

Chain methods to supercharge human-readability

Polars embraces method chaining to improve efficiency

  • The entire chain needs to be surrounded with ( )
  • Each line starts with .
  • You run the whole block of code at once
  • The consistent syntax can be read like a sentence
(customer_data
  .join(store_transactions, on = 'customer_id', how = 'left')
  .filter(pl.col('region') == 'West', pl.col('feb_2005') == pl.col('feb_2005').max())
  .with_columns(age = 2025 - pl.col('birth_year'))
  .select(pl.col(['age', 'feb_2005']))
  .sort(pl.col('age'), descending = True)
  .slice(0, 1)
)

Summarize variables, including grouped summaries

Human-readable code is designed to be consistent at the cost of being verbose

(customer_data
  .select(pl.col('income'))
  .mean()
)

(customer_data
  .group_by(pl.col(['region', 'college_degree']))
  .agg(n = pl.len())
)

(customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending = True)
)

Optimize underlying queries with lazy evaluation

Tag a data frame with .lazy() to have Polars optimize the query

  • Code is only run when we use .collect()
  • Look at the underlying optimized OLAP query using .explain()
df = (customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    n = pl.len(), 
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending = True)
).lazy()

df.collect()

df.explain()

seaborn.objects

  • New (under development) seaborn module for visualizing data
  • Like seaborn, it’s a higher-level interface for matplotlib
  • Unlike seaborn, it attempts to eliminate the need to fine-tune plots with matplotlib
  • More consistent syntax following the grammar of graphics
  • Each plot is composed of data, a mapping, and the specific graphic

Column plots

With a focus on human-readability, seaborn.objects relies on method chaining and is consistent and verbose

  • Install with pip install seaborn.objects
  • We can’t method chain between object classes
  • Data and the mapping happen with so.Plot()
  • The graphic is specified with so.Bar()
import seaborn.objects as so

# Option 1: Plot + Mark
region_count = (customer_data
  .group_by(pl.col('region'))
  .agg(n = pl.len())
)

(so.Plot(region_count, x = 'region', y = 'n')
  .add(so.Bar())
)

# Option 2: Plot + Mark + Stat
(so.Plot(customer_data, x = 'region')
  .add(so.Bar(), so.Hist())
)

Histograms

seaborn.objects provides flexibility through combining Marks and Stats

(so.Plot(customer_data, x = 'income')
  .add(so.Bars(), so.Hist())
)

Scatterplots

Each Mark and Stat has its own parameters to customize your plots

(so.Plot(customer_data, x = 'star_rating', y = 'income')
  .add(so.Dot(pointsize = 10, alpha = 0.5), so.Jitter(0.75))
)

Line plots

A Stat is not always sufficient – we often need to iterate between data wrangling and visualization

(so.Plot(customer_data, x = 'review_time', y = 'star_rating')
  .add(so.Line())
)

Line plots

rating_data = (customer_data
  .drop_nulls(pl.col('star_rating'))
  .select(pl.col(['review_time', 'star_rating']))
  .with_columns(
    pl.col('review_time').str.to_date(format='%m %d, %Y').alias('review_time')
  )
  .with_columns(
    pl.col('review_time').dt.year().alias('review_year')
  )
  .group_by('review_year')
  .agg(pl.mean('star_rating').alias('avg_star_rating'))
)

(so.Plot(rating_data, x = 'review_year', y = 'avg_star_rating')
  .add(so.Line())
)

Facets

There are lots of things you can do to customize a plot beyond the data, mapping, and graphic type

  • The .facet() method behaves like a .group_by() for the plot
  • You can .facet() by mapping different variables to col and row
(so.Plot(customer_data, x = 'income')
  .facet(col = 'gender')
  .add(so.Bars(), so.Hist())
)

Quarto

  • Technical typsetting and publishing system
  • Lightweight alternative/complement to Jupyter notebooks
  • Using Pandoc, renders to HTML, PDF, Word, PowerPoint, .ipynb, dashboards, etc.
  • Reference to a book resulting from folding printed sheets into four leaves

Quarto basics

Install the Quarto extension in VS Code (it comes pre-installed with Positron) or download and install the CLI

  • At the top of each .qmd file is a header written in YAML
  • Use markdown syntax outside of code blocks (i.e., ## to create section headings and - to produce lists)
  • The markdown is interpreted differently depending on the format you want to render to as specified in the header
  • Code cells begin with ‘‘‘{python} and end with ‘‘‘
  • If you don’t remember markdown syntax, Quarto has a visual editor
  • Pay attention to required white space
  • Render the document to produce the specified output